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Abstract 


Heart disease is one of the leading killers that are widely recognized through- 
out the globe. Large volumes of clinical data are stored in a variety of sys- 
tems and biological equipment at hospitals. It is essential to grasp the facts of 
heart disease in order to improve forecast accuracy. In this paper, experimental 
evaluations have been conducted to assess the effectiveness of models created 
utilizing classification algorithms and relevant attributes selected using Extra 
Tree feature selection procedures. Several people suffer originated at heart 
disease globally. It is necessary to use data mining and machine learning 
techniques to extract new insights originated at this data. Analyzing medical 
data sets and diagnostic issues, including heart disease, involved the use of a 
number of categorization approaches. However, these methods were only per- 
formed on small, balanced data; then, the features must be derived originated 
at trial and error. Additionally, several sectors have made substantial use of 
feature selection techniques to enhance classification performance. This paper 
aims to propose a comprehensive approach to enhance the prediction of heart 
disease using several machine learning methods such as Bagging, Support Vec- 
tor Machine, Multilayer Perception and Gradient Boost with feature selection 
methods such as extra tree. The experimental results showed improvements of 
prediction. Bagging received scores in training model on 80% data sample as 
99.08, 73.19, 67.20, 69.20 and 80.66 of accuracy, precision, recall, F 1-score 
and roc respectively. In the experiment, we have tested on 20% data sam- 
ple for each classifier algorithms and find Bagging classifier model perform 
higher score for accuracy, precision, recall, Fl-score and roc 92.62, 48.44, 
39.63, 41.89, 66.82 respectively. 
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1. Introduction: 


Advanced statistical techniques may be used to 
uncover relevant databases utilizing learning tech- 
niques, which are relatively new and promising tech- 
nologies. Many scholars are interested in the rel- 
atively young and emerging field of medical data 
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mining and knowledge exploration (Chauhan et al.). 
With greater medical data collection, doctors may 
be able to make more accurate diagnoses. Cardio- 
vascular illnesses have been shown to have the great- 
est death rates among these conditions in the major- 
ity of nations globally (Fu et al.). 
Analysist identify illnesses more accurately from 
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disease dataset. Various researchers has been used 
UCI repository disease dataset. They observed 
disease features by various features selection 
techniques and predicted by learning algorithms. 
Authors used various classifiers for training and 
testing purpose for gather more information from 
dataset (Jindal). 

The first and most common method for choosing 
features originated at labeled data is supervised fea- 
ture selection. The filter, wrapper, and embedding 
techniques are used in supervised feature selection 
methods. In the preprocessing step, Extra tree meth- 
ods are applied regardless of the learning algorithm. 
This method determines the score based on statisti- 
cal measures and their dependence on the class label 
for each characteristic (Gnaneswar and Jebarani). 


2. Related Work: 


In this research paper, we have studied various pre- 
vious year research works on different dataset. Var- 
ious authors used many selected machine learning 
and deep learning algorithms and test algorithms 
performance. Some used algorithms are listed in 
table 1, the literature makes it clear that classifier 
training using relevant features chosen by various 
feature selection techniques improves the classifier’s 
accuracy. 


3. Methodology: 


The ML research community frequently uses the 
UCI data set. It has 1025 records total, with 14 dis- 
tinct variables. However, just 14 of the factors [Age. 
Sex. Chest Pain. Resting Blood Pressure. Choles- 
terol. Fasting Blood Sugar. 

Restecg. Thalach. Exang. Oldpeak. Slope. Ca. 
Thal. And Class.] Have been found to be signif- 
icantly associated with heart disease. The descrip- 
tions of each variable are shown in table 2. 


3.1. Data Description: 


The dataset must be preprocessed in order to effec- 
tively reflect the data quality. Preprocessing meth- 
ods used on the dataset include removing missing 
values originated at features. Data preparation meth- 
ods like missing value management are used to make 
a smooth dataset. 


3.2. Proposed Work: 


With the use of fewer characteristics in a dataset of 
heart illness, the suggested study aims to improve 
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FIGURE 1. Representation of proposed work on 
heart disease using classifiers 


classification accuracy. Figure 1 shows the classifi- 
cation scheme for heart disorders. The components 
of the recommended framework are described in the 
sections that follow. 

The dataset are organized from UCI repository 
for training and testing of classifiers. When this 
dataset was created, there were 14 features and 1025 
instances. The label of the output characteristic 
(num) is separated into two classes, listed in fig- 
ure 1, The experimental procedure was created to 
assess how well the search algorithms and strate- 
gies worked together when they were applied to 
the following classification models: Gradient Boost, 
Multilayer Perception, Support Vector Machine, and 
Bagging. The results of our application classifica- 
tion models with 10-fold cross validation were given 
in improvements and decreases in accuracy with 
respect to epoch values. In the end, we examined 
the experimentation’s outcomes. The primary objec- 
tive, as previously stated, is to improve the ability to 
forecast heart disease. But this study also of a help- 
ful manual for picking the optimum feature selection 
method for various classification models. 
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TABLE 1. Representation of previous work on heart disease using machine learning 


Year Author Dataset Techniques 
2019 Ravindhar N et al. (Hasan et UCI + machine LR. NB. FK-NN. BPNN: 98 
al.) learning K-Means. and 
BPNN. 
2020 Magar R et al. (Ravindhar et UCI + machine LR. SVM. NB. LR: 83 
al.) learning And DT. 
2020 Rajdhan A et al. (Rajdhan et UCI + machine LR. DT. RF And RF: 90 
al.) learning NB. 
2020 Shah D et al. (Rajdhan et al.) UCI + machine NB. KNN. RE KNN:90 
learning And DT. 
2021 Jindal H et al. (Shah, Patel, UCI + machine KNN. LR. And KNN: 89 
and Bharti) learning KNN +LR. 
2021 Pandita A et al. (Pandita et UCI + machine LR. KNN. SVM. KNN: 89 
al.) learning NB. And RF. 
2021 Akella A et al. (A. Akella UCI + machine GLM. DT. RE = 87.64% NN: 93 
and S. Akella) learning SVM. NN. And 
KNN. 
2022 Gupta, C. et al. (Gupta et al.) UCI + machine RE DT. And LR. LR : 92 
learning 
2022 Truong, V.T et al. (Truong et UCI + machine AB. ET. LR. AB:90 
al.) learning MNB. CART. 
LDA. SVM. RE 
And XGM. 
2022 Abdalrada, AS. et UCI + machine SVM. NB. And DT: 90 
al. (Abdalrada et al.) learning DT. 
2022 Singh, N. etal.(N.Singhand UCI + machine KNN.DT.LR.NB. LR: 92 
Bhatnagar) learning And SVM. 


3.3. Methods Description: 


In this experiment, we have used various classi- 
fiers techniques and features selection techniques, 
describe as below: 


3.3.1. Feature Selection Technique: 


Without bootstrapping, the Extra Tree Classi- 
fier (Isabona et al.) creates randomised multiple 
decision trees with various sub-samples. [0.24 0.03 
0.06 0.110.11 0.030.04 0.10 0.03 0.09 0.05 0.06 
0.04 0.02]listed in figure 2. 

The issue of over fitting is avoided. The 
effectiveness of data mining techniques is decreased 
when inappropriate characteristics are included in 
the dataset. The best feature combinations must first 
be correctly identified before the best approaches 
are determined. It is anticipated that accuracy and 
other performance measures would increase when 
the optimum feature combination is applied over the 
methodologies. The process of creating a new set of 


features originated at the original features is known 
as feature extraction. In order to lessen the impact of 
duplication and inconsistency, it integrates the orig- 
inal characteristics (D. Yadav et al.). 


3.3.2. Machine Learning Classifiers: 


Data categorization is still a desirable area in 
machine learning. The next subsections give a quick 
introduction to some of the recently suggested algo- 
rithms that have been studied in a variety of fields, 
such as Gradient Boost, Multilayer Perception, and 
Support Vector Machine Bagging. 


Support Vector Machine:: The Support Vector Machine 
technique is used in this research study to forecast 
human heart disease. We choose the SVM method 
for the prediction process because, in comparison to 
other machine learning algorithms, it will provide a 
higher level of accuracy. A graphical description of 
the algorithm’s accuracy is presented. The results 
of the provided data are displayed graphically in the 
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TABLE 2. Representation of heart disease dataset attributes description 


S.No Variable Description 

1 Age Age regarding the person in years 

2 Sex Gender 1- Male, 0- Female 

3 Chest Pain 1- typical angina, 2- atypical angina, 3- non anginal pain, 4- asym- 
pomatic 

4 Resting Blood Pressure Blood Pressure in mm Hg during hospital admission 

5 Cholesterol Serum cholesterol in mg/dl 

6 Fasting Blood Sugar If (fbs>120mg/dl) 1 = True, 0 = False 

(fbs) 

7 Restecg Electrocardiography 0- Normal, 1- may be some problem, 2- defi- 
nite problem 

8 Thalach Maximum heart rate 

9 Exang Exercise induced angina 1- Yes, 0- No 

10 Oldpeak Induced ST depression due to exercise 

6) Slope Slope regarding the ST segment during peak exercise. 1- Upslop- 
ing, 2- flat, 3- downsloping 

12 Ca Number regarding blood vessels coloured by fluoroscopy Values 
ranges originated at 0 to 3 

13 Thallium Scan It is a method regarding analysing blood flow to heart muscles. 3= 
normal, 6= fixed defect, 7= reversable defect 

14 Class It is the output or dependable variable 0= No heart disease, 1,2,3 


and 4 represents the severity regarding the heart disease 


section below (Gangwar and G. K. Pal). 


Bagging:: This technique used as ensemble method 
in experiment. It is basically used to increase accu- 
racy level in machine learning algorithms. Bagging 
technique reduce the over fitting in analysis and gen- 
erate parallel way for all selected techniques. In sta- 
tistical analysis bagging generate average accuracy 
level with low variance (D. Singh, H. Yadav, and 
Agrawal). 


Multilayer Perception Model:: The suggested multilayer 
perception model was created with the express 
intention of lowering the confusion matrix and 
improving the precision of illness grouping based on 
severity. Originated at this point forward, the work 
suggests the multilayer perception model. The sug- 
gested MLP is set up so that the input layer is in 
charge of handling the training inputs, the hidden 
layers are accessible to consider weight modifica- 
tion, and lastly the output nodes are grouping the 
results into distinct categories ([sabona et al.). 


Gradient Boosting:: A traditional feed forward neural 
network classifier called Gradient Boost uses the 
output mistakes to train the network. Three levels 


of nodes make up Gradient Boost: the input layer, 
at least one or more hidden layer(s), and the out- 
put layer. The hidden layers are connected to the 
output layer by the input layer. Weighted values 
are used to process each layer. A Gradient Boost 
with a single hidden layer is depicted. Gradient 
Boost is a one-way error propagation method that 
has been trained and tested using back-propagation 
techniques (Gangwar and G. Pal). 


4. Result: 


In this experiment, we have used training sets 80% 
and testing sets 20%. The 10-fold build models are 
used for evaluating the accuracy, precision, recall, 
and F-score. The experimental setup used learning 
techniques in tables Gradient Boost, Multilayer Per- 
ception, Bagging, and Support Vector Machine. 


5. Discussion: 


shows the performance of all classifiers on different 
k-folds as k=6, 8 & 10 after applying Extra Tree fea- 
tures selection method. In comparison to the other 
classifiers, the findings indicated that Bagging had 
the best performance across all assessment metrics. 
In the experiment, we find on the k=10, each classi- 
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FIGURE 2. Representation of Extra Tree classifier work on heart disease 


fiers perform better for accuracy, precision, recall, 
Fl-score and roc. The Bagging classifier model 
(BCM) perform better compare to all other classi- 
fiers. Bagging received scores in training model on 
80% data sample as 99.08, 73.19, 67.20, 69.20 and 
80.66 of accuracy, precision, recall, Fl-score and 
roc respectively, listed in table 3. 


In the experiment, we have tested on 20% data 
sample for each classifier algorithms and find Bag- 
ging classifier model perform higher score for accu- 
racy, precision, recall, Fl-score and roc 92.62, 
48.44, 39.63, 41.89, 66.82 respectively. 


Our suggested methodology provides doctors 
with basic diagnosis for additional medical care. 


We draw the conclusion that the methodology can 
increase the rate at which individuals with heart 
disease are identified. As indicated in table 3, a 
comparison of the top approaches and past research 
on predicting heart disease was done for this work 
using the dataset. The comparative findings demon- 
strated that the Bagging classifier connected with an 
additional tree feature selection approach as Extra 
Tree using Bagging as the basis classifier produced 
the best results. 


6. Conclusions: 


In this research we have used heart disease dataset 
with 14 attributes and 1025 instances. In order to 
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TABLE 3. Representation of prediction model using heart disease 


Predicted training model on 80 % sample dataset size of heart disease 


Folds Classifiers Accuracy Precision Recall Fl1-Score ROC_AUC 
GBM 91.57 61.58 37.76 45.22 72.35 
K=10 SVM 93.59 59.69 49.21 52.95 62.94 
MLP 92.49 66.65 64.90 67.21 Dio 
BCM 99.08 73.19 67.20 69.20 80.66 
GBM 90.17 61.10 40.58 48.09 64.15 
=8 SVM 89.86 70.96 71.11 70.36 82.19 
MLP 93.82 62.78 34.82 44.09 63.60 
BCM 98.68 7185 68.85 69.65 81.56 
GBM 87.70 51.86 39.63 43.69 67.28 
K=6 SVM O10 60.18 51.51 93.22 69.63 
MLP 87.28 41.54 2151. 40.52 60.88 
BCM 93.25 58.30 30.88 Stak 63.91 
Predicted testing model on 20 % sample dataset size of heart disease 
GBM 86.57 61.58 37.76 47.22 72.35 
K=10 SVM 82.72 41.48 36.51 31.79 62.11 
MLP 88.85 64.33 22.13 31.66 70.31 
BCM 92.62 48.44 39.63 41.89 66.82 


improve the diagnosis of heart’s disease, this arti- 
cle evaluated the performance of a number of clas- 
sifiers using additional Extra Tree feature choices. 
Accuracy, precision, recall, and F-score were some 
of the assessment criteria that were employed. The 
accuracy of heart disease prediction increased to 
99% as a consequence of the additional tree fea- 
ture selection approach using Bagging, according to 
the data. More machine learning and deep learning 
techniques may be used in the future in conjunction 
with these feature selection method combinations. 
To enhance the accuracy of Heart’s disease pre- 
diction, further characteristics selection techniques 
may be researched. 
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